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Abstract 



This paper proposes the use ol "pattern- 
based" context-tree grammars as a basis 
for building machine translation (MT) sys- 
tems, which are now being adopted as per- 
sonal tools by a broad range of users in 
the cyberspace society. We discuss ma- 
jor requirements for such tools, including 
easy customization for diverse domains, 
the efficiency of the translation algorithm, 
and scalability (incremental improvement 
in translation quality through user interac- 
tion) , and describe how our approach meets 
these requirements. 



([Kaplan and Bresnan, 1982| ), HPSG (pollard and 
Sag, 1987), and Generative Lexicon flPustejovsky, 



1991), for example) have been proposed to facili- 



tate computationally precise description of natural- 
language syntax and semantics. It is possible that, 
with the descriptive power of these grammars and 
lexicons, individual usages of words and phrases may 
be defined specifically enough to give correct trans- 
lations. Practical implementation of MT systems 
based on these formalisms, on the other hand, would 
not be possible without much more efficient parsing 
and disambiguation algorithms for these formalisms 
and a method for building a lexicon that is easy even 
for novices to use. 



Corpus-based or example-based M T ( Bato and 
Nagao, 1990|; Bumita and Iida, 1991) and statisti- 



1 Introduction 

With the explosive growth of the World-Wide Web 
(WWW) as information source, it has become rou- 
tine for Internet users to access textual data written 
in foreign languages. In Japan, for example, a dozen 
or so inexpensive MT tools have recently been put 
on the market to help PC users understand English 
text in WWW home pages. The MT techniques em- 
ployed in the tools, however, are fairly conventional. 
For reasons of affordability, their designers appear 
to have made no attempt to tackle the well-known 
problems in MT, such as how to ensure the learnabil- 
ity of correct translations and facilitate customiza- 
tion. As a result, users are forced to see the same 

kinds nf translation errors nver and over again py- 



cal MT ( Brown ct al., 1993| ) systems provide the 
easiest customizability, since users have only to sup- 
ply a collection of source and target sentence pairs 
(a bilingual corpus). Two open questions, however, 
have yet to be satisfactorily answered before we can 
confidently build commercial MT systems based on 
these approaches: 

• Can the system be used for various domains 
without showing severe degradation of transla- 
tion accuracy? 

• What is the minimum number of examples (or 
training data) required to achieve reasonable 
MT quality for a new domain? 



r-ept they in rases wnere thpy involve mprply adding 



a missing word or compound to a nsrr dictionary or 

specifying one of several word-to-word translations 
as a correct choice. 

There are several alternative approaches that 
might eventually liberate us from this limitation on 
the usability of MT systems: 

Unification-based grammar for- 

malisms and lexical-semantics formalisms (see LFG 



TA G-based MT flAbcille, Schabes, and Joshi 
199C)[] and pattern-based translation (Maruyama 



1993) share many important properties for successful 



implementation in practical MT systems, namely: 

• The existence of a polynomial-time parsing al- 
gorithm 



^ee LTAG flSchabes, Abcillc. and Joshi. 1988|1 (Lex - 
icalized TAG) and STAG flShicbcr and Schabes, 199c| ) 
(Synchronized TAG) for each member of the TAG (Tree 
Adjoining Grammar) family. 



• A capability for describing a larger domain of 
locality flSchabes, Abeillc, and Joslii, 1988 ) 



• Synchronization (Shicbcr and Schabes, 199C) of 
the source and target language structures 

Readers should note, however, that the pars- 
ing algorithm for TAGs has Q(|G|n 6 )p| worst case 



time complexity ( Vijay-Shanker, 1987), and that 
the "patterns" in Maruyama's approach are merely 
context-free grammar (CFG) rules. Thus, it has 
been a challenge to find a framework in which we 
can enjoy both a grammar formalism with better 
descriptive power than CFG and more efficient pars- 
ing/generation algorithms than those of TAGs.^] 

In this paper, we will show that there exists a 
class of "pattern-based" grammars that is weakly 
equivalent to CFG (thus allowing the CFG parsing 
algorithms to be used for our grammars), but that 
it facilitates description of the domain of locality. 
Furthermore, we will show that our framework can 
be extended to incorporate example-based MT and 
a powerful learning mechanism. 

2 Pattern-Based Context-Free 
Grammars 

Pattern-based context-free grammars (PCFG) con- 
sists of a set of translation patterns. A pattern is a 
pair of CFG rules, and zero or more syntactic head 
and link constraints for nonterminal symbols. For 
example, the English-French translation pattern^] 

NP:1 miss:V:2 NP:3 -> S:2 

S:2 <- NP:3 manquer:V:2 o NP:1 

essentially describes a synchronized^ pair consisting 
of a left-hand-side English CFG rule (called a source 
rule) 

NP V NP -> S 
and a French CFG rule (called a target rule) 
S <— NP V a NP 

accompanied by the following constraints. 

Where |G| stands for the size of grammar G, and n 
is the length of an input string. 

3 Lexicalized CFG, or Tre e Insertion Grammar (TIG) 
(schabes and Waters, 1995), has been recently intro- 



duced to achieve such efficiency and lexicalization. 

4 and its inflectional variants — we will discuss inflec- 
tions and agreement issues later. 

5 The meaning of the wor d "synchronized" here is ex - 
actly the same as in STA G (fshicber and Schabes. 199()| ). 
See also bilingual signs (Tsujii and Fujita, 1991) for a 
discussion of the importance of combining the appropri- 
ate domain of locality and synchronization. 



1. Head constraints: The nonterminal symbol V 
in the source rule must have the verb miss as a 
syntactic head. The symbol V in the target rule 
must have the verb manquer as a syntactic head. 
The head of symbol S in the source (target) rule 
is identical to the head of symbol V in the source 
(target) rule as they are co-indexed. 

2. Link constraints: Nonterminal symbols in 
source and target CFG rules are linked if they 
are given the same index . Linked nonter- 
minal must be derived from a sequence of syn- 
chronized pairs. Thus, the first NP (NP:1) in 
the source rule corresponds to the second NP 
(NP:1) in the target rule, the Vs in both rules 
correspond to each other, and the second NP 
(NP:3) in the source rule corresponds to the first 
NP (NP:3) in the target rule. 

The source and target rules are called CFG skele- 
ton of the pattern. The notion of a syntactic head 
is similar to that used in unification grammars, al- 
though the heads in our patterns are simply encoded 
as character strings rather than as complex feature 
structures. A head is typically introduced]^] in preter- 
minal rules such as 

leave ->VV<- partir 

where two verbs, "leave" and "partir," are associated 
with the heads of the nonterminal symbol V. This is 
equivalently expressed as 

leave: 1 — » V:l V:l <— partir: 1 

which is physically implemented as an entry of an 
English-French lexicon. 

A set T of translation patterns is said to accept 
an input s iff there is a derivation sequence Q for s 
using the source CFG skeletons of T, and every head 
constraint associated with the CFG skeletons in Q is 
satisfied. Similarly, T is said to translate s iff there 
is a synchronized derivation sequence Q for s such 
that T accepts s, and every head and link constraint 
associated with the source and target CFG skeletons 
in Q is satisfied. The derivation Q then produces a 
translation t as the resulting sequence of terminal 
symbols included in the target CFG skeletons in Q. 
Translation of an input string s essentially consists 
of the following three steps: 

1. Parsing s by using the source CFG skeletons 



A nonterminal symbol X in a source or target CFG 
rule X 4— Xi ■ ■ ■ Xk can only be constrained to have one 
of the heads in the RHS Xi ■ ■ ■ Xk ■ Thus, monotonicity 
of head constraints holds throughout the parsing process. 



2. Propagating link constraints from source to tar- 
get CFG skeletons to build a target CFG deriva- 
tion sequence 

3. Generating t from the target CFG derivation 
sequence 

The third step is a trivial procedure when the target 
CFG derivation is obtained. 

Theorem 1 Let T be a PCFG. Then, there exists 
a CFG Gt such that for two languages L(T) and 
L(Gt) accepted by T and Gt, respectively, L(T) = 
L(Gt) holds. That is, T accepts a sentence s iff Gt 
accepts s. 

Proof: We can construct a CFG Gt as follows: 

1. Gt has the same set of terminal symbols as T. 

2. For each nonterminal symbol X in T, Gt in- 
cludes a set of nonterminal symbols {X^Ik; is 
either a terminal symbol in T or a special sym- 
bol e}. 

3. For each preterminal rule 

X'.i <— wi:l W2'-2 ■ ■ ■ Wk-k (1 < i < k), 

Gt includes^] 

Xuii <-«)i W2 ... wi~ (1 < i < k). 

If X is not co-indexed with any of Wi, Gt in- 
cludes 

X e <— wi w 2 ... w k . 

4. For each source CFG rule with head constraints 
{hi,h 2 , ...,hk) and indexes (ii,i 2 , ■ ■ ■ ,ifc), 

Y:ij i- hi:Xx:ii . . . h k X k :i k (1 < j < 
k), 

Gt includes 

Yhj <- Xhi Xh 2 . . . Xh k . 

If Y is not co-indexed with any of its children, 
we have 

Y e <- Xhi Xh 2 . . . Xh k . 

If Xj has no head constraint in the above rule, 
Gt includes a set of (N + 1) rules, where Xhj 
above is replaced with X w for every terminal 

7 Head constraints are trivially satisfied or violated in 
preterminal rules. Hence, we assume, without loss of 
generality, that no head constraint is given in pretermi- 
nal rules. We also assume that "X «— w" implies "X:l 

<- w.r. 



symbol w and X e (Yhj will also be replaced if 
it is co-indexed with Xj ) .^] 

Now, L(T) C L(Gt) is obvious, since Gt can simu- 
late the derivation sequence in T with corresponding 
rules in Gt- L(Gt) Q L(T) can be proven, with 
mathematical induction, from the fact that every 
valid derivation sequence of Gt satisfies head con- 
straints of corresponding rules in T. 

□ 

Proposition 1 Let a CFG G be a set of source CFG 
skeletons in T. Then, L(T) C L(G). 

Since a valid derivation sequence in T is always a 
valid derivation sequence in G, the proof is immedi- 
ate. Similarly, we have 

Proposition 2 Let a CFG H be a subset of source 
CFG skeletons in T such that a source CFG skeleton 
k is in H iff k has no head constraints associated with 
it. Then, L(H) C L(T). 

Two CFGs G and H define the range of CFL L(T). 
These two CFGs can be used to measure the "de- 
fault" translation quality, since idioms and colloca- 
tional phrases are typically translated by patterns 
with head constraints. 

Theorem 2 Let a CFG G be a set of source CFG 
skeletons in T. Then, L(T) C L(G) is undecidable. 

Proof: The decision problem, L(T) C L(G), of 
two CFLs such that L(T) C L(G) is solvable iff 
L(T) — L(G) is solvable. This includes a known un- 
decidable problem, L(T) = E*?, since we can choose 
a grammar U with L(U) = £*, nullify the entire set 
of rules in U by defining T to be a vacuous set {S:l 
<- a:S b :l, S b :l <- b:SV:l} U U (Su and S are start 
symbols in U and T, respectively), and, finally, let 
T further include an arbitrary CFG F. L(G) = E* 
is obvious, since G has {S <— Sb, S b <— Su} U U. 
Now, we have L(G) = L(T) iff L(F) = £*. 

□ 

Theorem 2 shows that the syntactic coverage of 
T is, in general, only computable by T itself, even 
though T is merely a CFL. This may pose a serious 
problem when a grammar writer wishes to know if 
there is a specific expression that is only acceptable 
by using at least one pattern with head constraints, 
for which the answer is "no" iff L(G) = L(T). One 
way to trivialize this problem is to let T include a 

8 Therefore, a single rule in T can be mapped to as 
many as (N + l) k rules in Gt, where N is the number of 
terminal symbols in T. Gt could be exponentially larger 
than T. 



pattern with a pair of pure CFG rules for every pat- 
tern with head constraints, which guarantees that 
L(H) = L(T) = L(G). In this case, we know that 
the coverage of "default" patterns is always identi- 
cal to L(T). 

Although our "patterns" have no more theoreti- 
cal descriptive power than CFG, they can provide 
considerably better descriptions of the domain of lo- 
cality than ordinary CFG rules. For example, 

be:V:l year:NP:2 old -> VP:1 
VP:1 «- avoir:V:l an:NP:2 

can handle such NP pairs as "one year" and "un an," 
and "more than two years" and "plus que deux ans," 
which would have to be covered by a large number 
of plain CFG rules. TAGs, on the other hand, are 
known to be "mildly context-sensitive" grammars, 
and they can capture a broader range of syntactic 
dependencies, such as cross-serial dependencies. The 
computational complexity of parsing for TAGs, how- 
ever, is 0(\G\n 6 ), which is far greater than that of 
CFG parsing. Moreover, defining a new STAG rule 
is not as easy for the users as just adding an entry 
into a dictionary, because each STAG rule has to be 
specified as a pair of tree structures. Our patterns, 
on the other hand, concentrate on specifying linear 
ordering of source and target constituents, and can 
be written by the users as easily as[] 

to leave * = de quitter * 

to be year:* old = d'avoir an:* 

Here, the wildcard "*" stands for an NP by default. 
The preposition "to" and "de" are used to specify 
that the patterns are for VP pairs, and "to be" is 
used to show that the phrase is the BE- verb and its 
complement. A wildcard can be constrained with a 
head, as in "house:*" and "maison:*". The internal 
representations of these patterns are as follows: 

leave:V:l NP:2 -> VP:1 
VP:1 <- quitter:V:l NP:2 

be:V:l year:NP:2 old -> VP:1 
VP:1 <- avoir:V:l an:NP:2 

These patterns can be associated with an explicit 
nonterminal symbol such as "V:*" or "ADJP:*" in 
addition to head constraints (e.g., "leave:V:*" ). By 
defining a few such notations, these patterns can 
be successfully converted into the formal represen- 
tations defined in this section. Many of the diver- 



gences (Dorr, 1993) in source and target language 



expressions are fairly collocational, and can be ap- 
propriately handled by using our patterns. Note 
the simplicity that results from using a notation in 
which users only have to specify the surface ordering 
of words and phrases. More powerful grammar for- 
malisms would generally require either a structural 
description or complex feature structures. 

3 The Translation Algorithm 

The parsing algorithm for translation patterns can 
be any of known CFG parsing algorithms includ- 
ing CKY and Earley algorithms^ At this stage, 
head and link constraints are ignored. It is easy 
to show that the number of target charts for a sin- 
gle source chart increases exponentially if we build 
target charts simultaneously with source charts. For 
example, the two patterns 



A:l B:2 
A:l B:2 



B:2 B:2 
B:2 A:l 



A:l B:2, 
B:2 A:l 



and 



will generate the following 2™ synchronized pairs of 
charts for the sequence of (n+1) nonterminal sym- 
bols AAA . . . AB, for which no effective packing of 
the target charts is possible. 

(A (A ... (A B))) with (A (A ... (A B))) 
(A (A ... (A B))) with ((A . . . (A B)) A) 

(A (A ... (A B))) with (((B A) A) ... A) 



Our strategy is thus to find a candidate set of 
source charts in polynomial time. We therefore 
apply heuristic measurements to identify the most 
promising patterns for generating translations. In 
this sense, the entire translation algorithm is not 
guaranteed to run in polynomial time. Practically, a 
timeout mechanism and a process for recovery from 
unsuccessful translation (e.g., applying the idea of 
fitted parse ( [Icnscn and Hcidorn, 1983 ) to target 
CFG rules) should be incorporated into the transla- 
tion algorithm. 

Some restrictions on patterns must be imposed 
to avoid infinitely many ambiguities and arbitrarily 
long translations. The following patterns are there- 
fore not allowed: 



1. A^XY 



D 



2. A -> X Y <- Ci . . . B . . . C k 



9 By sacrificing linguistic accuracy for the description 
of syntactic structures. 



Our prototype implementation was based on the 
Earley algorithm, since this does not require lexicaliza- 
tion of CFG rules. 



if there is a cycle of synchronized derivation such 
that 

A -> X . . . -> A and 

B (or Ci...S...C fe ) -fy...-^B, 



where A, B, X, and Y are nonterminal symbols with 
or without head and link constraints, and C's are 
either terminal or nonterminal symbols. 

The basic strategy for choosing a candidate 
derivation sequence from ambiguous parses is as fol- 
l ows.Pl A sim plified view of the Earley algorithm 
( Earley, 197C| ) consists of three major components, 
predict(i), complete (i), and scan(i), which are called 
at each position i = 0, 1, . . . , n in an input string I — 
S1S2 ■ ■ ■ s n . Predict (i) returns a set of currently ap- 
plicable CFG rules at position i. Complete(i) com- 
bines inactive charts ending at i with active charts 
that look for the inactive charts at position i to pro- 
duce a new collection of active and inactive charts. 
Scan(i) tries to combine inactive charts with the 
symbol s,+i at position i. Complete(n) gives the 
set of possible parses for the input I. 

Now, for every inactive chart associated with a 
nonterminal symbol X for a span of (i,j) (1 <i,j< 
n), there exists a set P of patterns with the source 
CFG skeleton, ... — > X. We can define the fol- 
lowing ordering of patterns in P; this gives patterns 
with which we can use head and link constraints for 
building target charts and translations. These can- 
didate patterns can be arranged and associated with 
the chart in the completeQ procedure. 



1. Prefer a pattern p with a source CFG skeleton 
X <— X\ ■ ■ ■ Xk over any other pattern q with 
the same source CFG skeleton X <— X± ■ ■ ■ Xk , 
such that p has a head constraint h:X{ if q has 
h:Xi (i = 1, . . . , k). The pattern p is said to 
be more specific than q. For example, p = 
"leave: V:l house:NP — » VP:1" is preferred to 
q = "leave:V:l NP -> VP:1". 

2. Prefer a pattern p with a source CFG skeleton 
to any pattern q that has fewer terminal sym- 
bols in the source CFG skeleton than p. For 
example, prefer "take:V:l a walk" to "take:V:l 
NP" if these patterns give the VP charts with 
the same span. 



3. Prefer a pattern p which does not violate any 
head constraint over those which violate a head 
constraint. 

4. Prefer the shortest derivation sequence for each 
input substring. A pattern for a larger domain 
of locality tends to give a shorter derivation se- 
quence. 

These preferences can be expressed as numeric 
values (cost) for patterns.^] Thus, our strategy fa- 
vors lexicalized (or head constrained) and colloca- 
tional patterns, which is exactly what we are go- 
ing to achieve with pattern-based MT. Selection of 
patterns in the derivation sequence accompanies the 
construction of a target chart. Link constraints are 
propagated from source to target derivation trees. 
This is basically a bottom-up procedure. 

Since the number M of distinct pairs (X,w) , for a 
nonterminal symbol X and a subsequence w of input 
string s, is bounded by Kn 2 , we can compute the m- 
best choice of pattern candidates for every inactive 
chart in time 0(\T\Kn 3 ) as claimed by Maruyama 
( Maruyama, 1993]), and Schabes and Waters (Sch- 
abes and Waters, 1995). Here, K is the number of 



distinct nonterminal symbols in T, and n is the size 
of the input string. Note that the head constraints 
associated with the source CFG rules can be incor- 
porated in the parsing algorithm, since the number 
of triples (X,w,h) , where h is a head of X, is bounded 
by Kn 3 . We can modify the predictQ, completeQ, 
and scan() procedures to run in 0(|T|i4Tn 4 ) while 
checking the source head constraints. Construction 
of the target charts, if possible, on the basis of the m 
best candidate patterns for each source chart takes 
0(Kn 2 m) time. Here, m can be larger than 2 n if we 
generate every possible translation. 

The reader should note critical differences between 
lexicalized grammar rules (in the sense of LTAG and 
TIG) and translation patterns when they are used 
for MT. 

Firstly, a pattern is not necessarily lexicalized. An 
economical way of organizing translation patterns 
is to include non-lexicalized patterns as "default" 
translation rules. 

Secondly, lexicalization might increase the size of 
STAG grammars (in particular, compositional gram- 
mar rules such as ADJP NP — > NP) considerably 



11 This s trategy is similar to that of transfer-driven MT 
(TDMT) (|Furusc and Iida, 1994|). TDMT, however, is 



based on a combination of declarative/procedural knowl- 
edge sources for MT, and no clear computational prop- 
erties have been investigated. 



12 A similar preference can be denned for the tar- 
get part of each pattern, but we found many counter- 
examples, where the number of nonterminal symbols 
shows no specificity of the patterns, in the target part 
of English-to- Japanese translation patterns. Therefore, 
only the head constraint violation in the target part is 
accounted for in our prototype. 



when a large number of phrasal variations (adjec- 
tives, verbs in present participle form, various nu- 
meric expressions, and so on) multiplied by the num- 
ber of their translations, are associated with the 



ADJP part. The notion of structure sharing (Vijay- 
Shankcr and Schabes, 1992) may have to be ex- 



tended from lexical to phrasal structures, as well as 
from monolingual to bilingual structures. 

Thirdly, a translation pattern can omit the tree 
structure of a collocation, and leave it as just a se- 
quence of terminal symbols. The simplicity of this 
helps users to add patterns easily, although precise 
description of syntactic dependencies is lost. 

4 Features and Agreements 

Translation patterns can be enhanced with unifica- 
tion and feature structures to give patterns addi- 
tional power for describing gender, number, agree- 
ment, and so on. Since the descriptive power of 
unification-based grammars is considerably greater 
than that of CFG ( Berwick, 1982] ), feature struc- 
tures have to be restricted to maintain the efficiency 
of parsing and generation algorithms. Shicber and 



Schabes briefly discuss the issue (Shieber and Sch- 



abes, 1990). We can also extend translation patterns 



as follows: 

Each nonterminal node in a pattern can be 
associated with a fixed-length vector of bi- 
nary features. 

This will enable us to specify such syntactic de- 
pendencies as agreement and subcategorization in 
patterns. Unification of binary features, however, 
is much simpler: unification of a feature-value pair 
succeeds only when the pair is either (0,0) or (1,1). 
Since the feature vector has a fixed length, unifica- 
tion of two feature vectors is performed in a constant 
time. For example, the patterns^ 



V:l 
V:l 
V:l 
V:l 



-TRANS NP:2 
-TRANS NP:2 
-INTRANS -> 
-INTRANS 



VP:1 VP:1 



VP:1 VP:1 



are unifiable with transitive and intransitive verbs, 
respectively. We can also distinguish local and head 
features, as postulated in HPSG. Simplified version 
of verb subcategorization is then encoded as 

VP:l:+TRANS-OBJ NP:2 -> VP:l:+OBJ 
VP:l:+OBJ <- VP:l:+TRANS-OBJ NP:2 



Again, these patterns can be mappe d to a weakly 
equivalent set of C FG rules. See GPSG (Gazdar, Pul 
lum, and Sag, 1985) for more details. 



where "-OBJ" is a local feature for head VPs in 
LHSs, while "+OBJ" is a local feature for VPs in 
the RHSs. Unification of a local feature with + OBJ 
succeeds since it is not bound. 

Agreement on subjects (nominative NPs) and 
finite-form verbs (VPs, excluding the BE verb) is 
disjunctively specified as 

NP : 1 : +N0MI+3RD+SG VP : 2 : +FIN+3SG 
NP : 1 : +N0MI+3RD+PL VP : 2 : +FIN-3SG 
NP:1:+NDMI-3RD VP : 2 : +FIN-3SG 
NP:1:+N0MI VP : 2 : +FIN+PAST 

which is collectively expressed as 

NP:1:*AGRS VP:2:*AGRV 

Here, *AGRS and *AGRV are a pair of aggregate 
unification specifiers that succeeds only when one 
of the above combinations of the feature values is 
unifiable. 

Another way to extend our grammar formalism is 
to associate weights with patterns. It is then possi- 
ble to rank the matching patterns according to a lin- 
ear ordering of the weights rather than the pairwise 
partial ordering of patterns described in the previ- 
ous section. In our prototype system, each pattern 
has its original weight, and according to the prefer- 
ence measurement described in the previous section, 
a penalty is added to the weight to give the effective 
weight of the pattern in a particular context. Pat- 
terns with the least weight are to be chosen as the 
most preferred patterns. 

Numeric weights for patterns are extremely use- 
ful as means of assigning higher priorities uniformly 
to user-defined patterns. Statistical training of pat- 
terns can also be incorporated to calculate such 



weights systematically (Fujisaki et al., 1989). 

Figure [l] shows a sample translation of the input 
"He knows me well," using the following patterns. 



NP:1:*AGRS VP:1:*AGRS -> S:l 
S:l <- NP:1:*AGRS VP:1:*AGRS 

VP:1 ADVP:2 -» VP:1 
VP:1 <- VP:1 ADVP:2 



(a) 



(b) 

know:VP:l:+OBJ well -> VP:1 

VP:1 <- connaitre:VP:l:+OBJ bien ... (c) 

VI NP:2 -> VP:l:+OBJ 
VP:l:+OBJ «- VI NP:2:-PRO ... (d) 

VI NP:2 -> VP:l:+OBJ 
VP:l:+OBJ <- NP:2:+PRO VI ... (e) 

To simplify the example, let us assume that we 
have the following preterminal rules: 

he -> NP:+PRO+NOMI+3RD+SG 
NP:+PRO+NOMI+3RD+SG <- il ... (f) 



mc -> NP:+PR0+CAUS+SG-3RD 
NP:+PR0+CAUS+SG-3RD <- me ... (g) 

knows V:+FIN+3SG 
V:+FIN+3SG «- sait ... (h) 
knows V:+FIN+3SG 
V:+FIN+3SG <- connait ... (i) 

well -» AD VP AD VP <- bicn ... (j) 
well -» ADVP ADVP <- bcaucoup ... (k) 

In the above example, the Earley-based algorithm 
with source CFG rules is used in Phase 1. In Phase 
2, head and link constraints are examined, and unifi- 
cation of feature structures is performed by using the 
charts obtained in Phase 1. Candidate patterns are 
ordered by their weights and preferences. Finally, 
in Phase 3, the target charts are built to generate 
translations based on the selected patterns. 

5 Integration of Bilingual Corpora 

Integration of translation patterns with translation 
examples, or bilingual corpora, is the most impor- 
tant extension of our framework. There is no dis- 
crete line between patterns and bilingual corpora. 
Rather, we can view them together as a uniform 
set of translation pairs with varying degrees of lex- 
icalization. Sentence pairs in the corpora, however, 
should not be just added as patterns, since they are 
often redundant, and such additions contribute to 
neither acquisition nor refinement of non-sentential 
patterns. 

Therefore, we have been testing the integration 
method with the following steps. Let T be a set of 
translation patterns, B be a bilingual corpus, and 
(s,t) be a pair of source and target sentences. 

1. [Correct Translation] If T can translate s into 
t, do nothing. 

2. [Competitive Situation] If T can translate s 
into t' (t ^ t'), do the following: 

(a) [Lexicalization] If there is a paired deriva- 
tion sequence Q of (s,t) in T, create a new 
pattern p' for a pattern p used in Q such 
that every nonterminal symbol X in p with 
no head constraint is associated with h:X 
in q, where the head h is instantiated in X 
of p. Add p' to T if it is not already there. 
Repeat the addition of such patterns, and 
assign low weights to them until the refined 
sequence Q becomes the most likely trans- 
lation of s. For example, add 
leave:VP:l:+OBJ 
considerably:ADVP:2 -> VP:1 



Input : He knows me well 

Phase 1: Source Analysis 

[0 1] He — > (f) NP 

(active arc [0 1] (a) NP.VP) 
[1 2] knows — > (h) V, (i) V 
(active arcs [1 2] (d) V.NP, 
[1 2] (e) V.NP) 
[2 3] me — > (g) NP 

(inactive arcs [1 3] (d) V NP, 
[1 3] (e) V NP) 

[1 3] knows me > (d) , (e) VP 

(inactive arc [0 3] (a) NP VP, 
active arcs [1 3] (b) VP. well, 
[1 3] (c) VP. ADVP) 

[0 3] He knows me > (a) S 

[3 4] well — > (j) ADVP, (k) ADVP 
(inactive arcs [1 4] (b) VP ADVP, 
[1 4] (c) VP ADVP) 

[1 4] knows me well > (b) , (c) VP 

(inactive arc [0 4] (a) NP VP) 
[0 4] He knows me well > (a) S 

Phase 2: Constraint Checking 

[0 1] He — > (f) NP 

[1 2] knows — > (i) V, (j) V 

[2 3] me — > (g) NP 

[1 3] knows me > (e) VP 

(pattern (d) fails) 

[0 3] He knows me > (a) S 

[3 4] well — > (i) ADVP, (j) ADVP 
[1 4] knows me well > (b) , (c) VP 

(preference ordering (c) , (b)) 
[0 4] He knows me well > (a) S 

Phase 3: Target Generation 

[0 4] He knows me well > (a) S 

[0 1] He — > il 

[1 4] knows me well > (c) VP 

well > bien 

[1 3] knows me > (e) VP 

[12] knows > connait 

(h) violates a head constraint 
[2 3] me > me 

Translation: il me connait bien 



Figure 1: Sample Translation 



VP:1 «- laisser:VP:l:+OBJ con- 
sider ablement : AD VP : 2 
if the existing VP ADVP pattern does not 
give a correct translation, 
(b) [Addition of New Patterns] If there is 
no such paired derivation sequence, add 
specific patterns, if possible, for idioms and 
collocations that are missing in T, or add 
the pair (s,t) to T as a translation pattern. 
For example, add 

leave:VP:l:+OBJ behind -> VP:1 
VP:1 «- laisser:VP:l:+OBJ 
if the phrase "leave it behind" is not cor- 
rectly translated. 

3. [Translation Failure] If T cannot translate s 
at all, add the pair (s,t) to T as a translation 
pattern. 

The grammar acquisition scheme described above 
has not yet been automated, but has been manually 
simulated for a set of 770 English-Japanese simple 
sentence pairs designed for use in MT system eval- 
uation, which is available from JEIDA (the J apan 
Electronic Industry Development Association) ( (the 



Japan Electronic Industry Development Associa- 



tion), 1995 ), including: 



#100: Any question will be welcomed. 
#200: He kept calm in the face of great 
danger. 

#300: He is what is called "the man in the 
news" . 

#400: Japan registered a trade deficit of 
$101 million, reflecting the country's eco- 
nomic sluggishness, according to govern- 
ment figures. 

#500: I also went to the beach 2 weeks 
earlier. 

At an early stage of grammar acquisition, [Addition 
of New Patterns] was primarily used to enrich 
the set T of patterns, and many sentences were un- 
ambiguously and correctly translated. At a later 
stage, however, JEIDA sentences usually gave sev- 
eral translations, and [Lexicalization] with care- 
ful assignment of weights was the most critical task. 
Although these sentences are intended to test a sys- 
tem's ability to translate one basic linguistic phe- 
nomenon in each simple sentence, the result was 
strong evidence for our claim. Over 90% of JEIDA 
sentences were correctly translated. Among the fail- 
ures were: 

#95: I see some stamps on the desk . 
#171: He is for the suggestion, but I'm 



against it. 

#244: She made him an excellent wife. 
#660: He painted the walls and the floor 
white. 

Some (prepositional and sentential) attachment am- 
biguities needs to be resolved on the basis of seman- 
tic information, and scoping of coordinated struc- 
tures would have to be determined by using not only 
collocational patterns but also some measures of bal- 
ance and similarities among constituents. 

6 Conclusions and Future Work 

Some assumptions about patterns should be re- 
examined when we extend the definition of patterns. 
The notion of head constraints may have to be ex- 
tended into one of a set membership constraint if we 
need to handle coordinated structures (Kaplan and 
Maxwell III, 198? ). Some light-verb phrases cannot 



be correctly translated without "exchanging" several 
feature values between the verb and its object. A 
similar problem has been found in be-verb phrases. 

Grammar acquisition and corpus integration are 
fundamental issues, but automation of these pro- 



cesses ( Watanabe, 1993 ) is still not complete. Devel- 
opment of an efficient translation algorithm, not just 
an efficient parsing algorithm, will make a significant 
contribution to research on synchronized grammars, 
including STAGs and our PCFGs. 
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